PAC Learning

Definition (PAC-learnability)

A hypothesis class $H$ is PAC-learnable if there exists a learning algorithm with the following property: For every $\epsilon, \delta \in (0, 1)$ , every distribution $D$ over $X$ , and every $h \in H$ , when the algorithm is given $m(\epsilon, \delta)$ samples drawn from $D$ and labeled by $h$ , then the algorithm produces a hypothesis $\hat{h}$ such that with probability $1 - \delta, L(\hat{h}) \leq \epsilon$ . (Note that the probability is over randomness in the training set as well as any internal algorithmic randomness).

Theorem 1 (PAC bound for finite hypothesis class)

Let $H$ be a hypothesis class with finite size $|H|$ . Then $H$ is PAC-learnable with

\begin{gather*} m(\epsilon, \delta) = O({\log(|H|/\delta) \over \epsilon}) \end{gather*}

The proof is shown in the previous note.
We can solve for m and say that if $m \geq {\log(|H|/\delta) \over \epsilon}$ , the probability of failure in PAC learning is at most $\delta$ , and hence we succeed w.p. $1 − \delta$ .

Agnostic PAC-Learning

PAC learning requires the realizibility assumption, meaning that it cannot handle noise / error within the labels.
Agnostic PAC learning relaxes the realizability distribution, and also allows the labels to be noisy.
PAC learnable implies agnostic PAC learnable. Therefore Agnostic PAC learning is more general than PAC learning.

Definition (Agnostic PAC Learnability)

A hypothesis class $H$ is agnostic PAC learnable if there exist a function $m_H : (0, 1)^2 \to \N$ and a learning algorithm with the following property: For every $\epsilon, \delta \in (0, 1)$ and for every distribution $D$ over $X \times Y$ , when running the learning algorithm on $m \geq m_H(\epsilon, \delta)$ i.i.d. examples generated by $D$ , the algorithm returns a hypothesis $h$ such that, with probability of at least $1 - \delta$ (over the choice of the $m$ training examples),

\begin{gather*} L_D(h) \leq \min_{h' \in H} L_D(h') + \epsilon. \end{gather*}

Definition ( $\epsilon$ -representative sample)

A training set $S$ is called $\epsilon$ -representative (w.r.t. domain $Z$ , hypothesis class $H$ , loss function $l$ , and distribution $D$ ) if

\begin{gather*} \forall h \in H, |L_S(h) - L_D(h)| \leq \epsilon. \end{gather*}

Lemma

Assume that a training set S is $\epsilon \over 2$ -representative (w.r.t. domain $Z$ , hypothesis class $H$ , loss function $l$ , and distribution $D$ ). Then, any output of $ERM_H(S)$ , namely, any $h_S \in \argmin_{h \in H}L_s(h)$ , satisfies

\begin{gather*} L_D(h_S) \leq \min_{h \in H} L_D(h) + \epsilon. \end{gather*}

Proof

Let $\bar{h} = \argmin_{h \in H}L_D(h)$ . Then

\begin{align*} L_D(h_S) & \leq L_S(h_S) + \epsilon/2 \\ & \leq L_S(\bar{h}) + \epsilon/2 \\ & \leq L_D(\bar{h}) + \epsilon/2 + \epsilon/2 = L_D(\bar{h}) + \epsilon \end{align*}

Definition (Uniform Convergence)

We say that a hypothesis class $H$ has the uniform convergence property (w.r.t. a domain $Z$ and a loss function $l$ ) if there exists a function $m_H^{UC}:(0, 1)^2 \to \mathbb{N}$ such that for every $\epsilon, \delta \in (0, 1)$ and for every probability distribution $D$ over $Z$ , if $S$ is a sample of $m \ge m_H^{UC}(\epsilon, \delta)$ examples drawn i.i.d. according to $D$ , then, with probability of at least $1-\delta$ , $S$ is $\epsilon$ -representative.

Corollary

If a class $H$ has the uniform convergence property with a function $m_H^{UC}$ then the class is agnostically PAC learnable with the sample complexity $m_H(\epsilon, \delta) \le m_H^{UC}(\epsilon / 2, \delta)$ . Furthermore, in that case, the $ERM_H$ paradigm is a successful agnostic PAC learner for $H$ .

Finite Classes are Agnostic PAC Learnable

Let $H$ be a class $|H| < \infty$ . Then $H$ is agnostic PAC learnable with

\begin{gather*} m_H(\epsilon, \delta) \leq \biggl \lceil {2\log(2|H|/\delta) \over \epsilon^2} \biggl \rceil. \end{gather*}

Proof

We will show:

For any fixed $h \in H$ and $\epsilon > 0$ , $\begin{gather*} Pr[|\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - 2e^{-2n\epsilon^2} \end{gather*}.$
For any $\epsilon > 0$ , $\begin{gather*} Pr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - 2|H|e^{-2n\epsilon^2}. \end{gather*}$
For $m \geq \biggl \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \biggl \rceil$ , with probability $1 - \delta$ , $\begin{gather*} |\hat{L}(h) - L(h)| < \epsilon \quad \forall h \in H. \end{gather*}$
By UC, $H$ is agnostic PAC learnable with $\begin{gather*} m \geq \biggl \lceil {2\log(2|H|/\delta) \over \epsilon^2} \biggl \rceil \end{gather*}$ samples.

Proof of (1)

Lemma (Hoeffdings inequality). Let $X_1, X_2, \cdots , X_n$ be independent random variables such that $a_i \leq x_i \leq b_i$ for each $i \in [n]$ . Then for any $\epsilon > 0$ ,

\begin{gather*} Pr \Biggl [ \Biggl |{1 \over m} \sum_{i = 1}^{m} x_i - \mathbb{E} \Biggl[{1 \over m} \sum_{i = 1}^{m} x_i \Biggl ] \Biggl | \leq \epsilon \Biggl ] \geq 1 - 2exp \Biggl ({-2m^2 \epsilon^2 \over \sum_{i = 1}^{m} (b_i - a_i)^2} \Biggl ). \end{gather*}

Given Hoeffding’s, we prove (1): take each $X_i = L(h(x_i), y_i)$ . Since $L(h(x_i), y_i) = \mathbb{1}(h(x_i) \neq y_i), X_i \in \{0, 1\}$ , and thus $a_i = 0, b_i = 1$ . Therefore, we have

\begin{gather*} Pr \Biggl [ \Biggl |{1 \over m} \sum_{i = 1}^{m} x_i - \mathbb{E} \Biggl[{1 \over m} \sum_{i = 1}^{m} x_i \Biggl ] \Biggl | \leq \epsilon \Biggl ] \geq 1 - 2exp(-2m\epsilon^2)). \end{gather*}

Proof of (2)

\begin{align*} Pr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] & = 1 - Pr \Biggl [\bigcup_{i=1}^{|H|} |\hat{L}(h) - L(h)| > \epsilon \Biggl ] \\ & \geq 1 - \sum_{i = 1}^{|H|} Pr[|\hat{L}(h_i) - L(h_i)| > \epsilon] \\ & \geq 1 - |H|(2exp(-2m \epsilon^2)) \end{align*}

Proof of (3)

Set $m \geq \bigg \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \bigg \rceil$ in the previous step, then we have the failure probability is bounded by $\delta$ for any $h \in H$ i.e. $Pr[\forall h \in H, |\hat{L}(h) - L(h)| \leq \epsilon] \geq 1 - \delta$ .

Proof of (4)

In (3), we have shown that $H$ has the uniform convergence property, and thus by UC, we have $H$ is agnostic PAC learnable with $m \geq \bigg \lceil {\log(2|H|/\delta) \over 2 \epsilon^2} \bigg \rceil$ samples.

Theorem (Bayes Classifier)

Let $S$ be a finite set. $X$ has (discrete) distribution $\Pi$ over $S$ . Then the best possible binary classifier is given by

\begin{gather*} g^*(x) = sign(\mathbb{E}(Y | X = x)). \end{gather*}

$y(x) = \mathbb{E}(Y | X = x), Y \in \{+1, -1\}$ . $g^*(x)$ is known as Bayes Classifier.
I will skip the proof since it's too easy.

Bayes Risk

\begin{gather*} L^* = L(g^*) = \sum_{x \in S} ({1-|y(x)| \over 2}) \Pi (x) \end{gather*}

Definition (PAC-learnability)​

Theorem 1 (PAC bound for finite hypothesis class)​

Agnostic PAC-Learning​

Definition (Agnostic PAC Learnability)​

Definition (ϵ\epsilonϵ-representative sample)​

Lemma​

Proof​

Definition (Uniform Convergence)​

Corollary​

Finite Classes are Agnostic PAC Learnable​

Proof​

Proof of (1)​

Proof of (2)​

Proof of (3)​

Proof of (4)​

Theorem (Bayes Classifier)​

Bayes Risk​

Definition (PAC-learnability)

Theorem 1 (PAC bound for finite hypothesis class)

Agnostic PAC-Learning

Definition (Agnostic PAC Learnability)

Definition ( $\epsilon$ -representative sample)

Lemma

Proof

Definition (Uniform Convergence)

Corollary

Finite Classes are Agnostic PAC Learnable

Proof

Proof of (1)

Proof of (2)

Proof of (3)

Proof of (4)

Theorem (Bayes Classifier)

Bayes Risk